This is the second project of edwisor Data Science career path.
This project is supervised learning classification problem. The goal is to predict the number of hours an employee can absent and explore the factor that have good correlation with high rate of absenteeism.
The data provided belongs to a courier company. For a high paced business enviornment customer sataisfaction is the first priority and it is mainly based on the performance of an employee. Employee with low performance leads to the dissatisfaction of customers which in turn effects the revenue of the company and absenteeism is one of the reasons.Excessive absence can seriously effect the organization and it can lead to low productivity and higher costs. So, the factors that causes the absenteeism are need to be addressed and certain measures should be taken to rectify them. This project targets the factors that mostly effects the employee performace.
The data contains information related to employee health, drinking habit, smoking habit, number of childern they have, how much they expend on transportation, their age etc. Our aim is to study these attributes and how they related to an employee's performance.
This analysis leads to the better understanding of the factors and help the organization human resource department to improve the process by taking steps which can minimize these factors.
Importing the required libraries
The given data have 740 observations and 21 features.
The featues are as following:-
Feature | Description |
---|---|
ID | Id of the Employee |
Reason for absence | Reason given by employee. Ex. health issue, family related reasons etc. |
Month of absence | Month |
Day of the week | Weekdays from Monday to Saturday |
Seasons | Season |
Transportation expense | How much they spend on travelling from home to work |
Distance from Residence to Work | Distance from residence to work in km |
Service time | Service time |
Age | Age of the Employee |
Work load Average/day | Employee daily average workload |
Hit target | Hit target |
Disciplinary failure | If the employee have any disciplinary issue in the past. |
Education | Employee highest eduction |
Son | Number of children they have |
Social drinker | If an employee drink or not |
Social smoker | if an employee smoke or not |
Pet | Number of pets they have |
Weight | Weight of an Employee |
Height | Height of an Employee |
Body mass index | Body mass index calculated as weight/(height2) |
Absenteeism time in hours | The Number of hours an employee was absent |
Absenteesism time in hours is our target variable.
#Importing required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#Importing data
EmpAb = pd.read_excel("https://s3-ap-southeast-1.amazonaws.com/edwisor-india-bucket/projects/data/DataN0101/Absenteeism_at_work_Project.xls")
EmpAb.shape
EmpAb.head()
EmpAb.describe()
EmpAb.columns
EmpAb.dtypes
EmpAb
Before doing any analysis we need prepare the data for better understanding
First we need to identify the categorical and continous features and separate them.
The numerical values of categorical features should be replaced with respective values.
Changing values in reason of absence.
Replacing 1, 2, 3, 4 in Seasons feature to
Replacing 1, 2, 3, 4 in Education feature to
Categorical = ['ID','Reason for absence','Month of absence','Day of the week',
'Seasons','Son','Pet','Disciplinary failure','Education',
'Social drinker','Social smoker']
Continuous = ['Transportation expense','Distance from Residence to Work',
'Service time','Age','Work load Average/day ','Hit target','Weight',
'Height','Body mass index','Absenteeism time in hours']
#Separating variables into categorical and continuous
#Copying the data into new dataset for data analysis
data = EmpAb.copy()
data['ID'] = data['ID'].astype('category')
data['Reason for absence'] = data['Reason for absence'].replace([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28],
['Undefined absence',
'Certain infectious and parasitic diseases',
'Neoplasms',
'Diseases of the blood, blood-forming organs and immune mechanism disorders',
'Endocrine, nutritional and metabolic diseases',
'Mental and behavioural disorders',
'Diseases of the nervous system',
'Diseases of the eye and adnexa',
'Diseases of the ear and mastoid process',
'Diseases of the circulatory system',
'Diseases of the respiratory system',
'Diseases of the digestive system',
'Diseases of the skin and subcutaneous tissue',
'Diseases of the musculoskeletal system and connective tissue',
'Diseases of the genitourinary system',
'Pregnancy, childbirth and the puerperium',
'Certain conditions originating in the perinatal period',
'Congenital malformations, deformations and chromosomal abnormalities',
'Symptoms, signs, abnormal clinical and laboratory findings, not elsewhere classified',
'Injury, poisoning and certain other consequences of external causes',
'External causes of morbidity and mortality',
'Factors influencing health status and contact with health services',
'patient follow-up',
'medical consultation',
'blood donation',
'laboratory examination',
'unjustified absence',
'physiotherapy',
'dental consultation']).astype('category')
data['Month of absence'] = data['Month of absence'].astype('category')
data['Day of the week'] = data['Day of the week'].replace([2,3,4,5,6],
['Monday',
'Tuesday',
'Wednesday',
'Thrusday',
'Friday']).astype('category')
data['Seasons'] = data['Seasons'].replace([1,2,3,4],
['Summer',
'Autumn',
'Winter',
'Spring']).astype('category')
data['Disciplinary failure'] = data['Disciplinary failure'].replace([0,1],
['No',
'Yes']).astype('category')
data['Education'] = data['Education'].replace([1,2,3,4],
['High School',
'Graduate',
'Postgraduate',
'Master and Doctor']).astype('category')
data['Social drinker'] = data['Social drinker'].replace([0,1],
['No',
'Yes']).astype('category')
data['Social smoker'] = data['Social smoker'].replace([0,1],
['No',
'Yes']).astype('category')
data['Son'] = data['Son'].astype('category')
data['Pet'] = data['Pet'].astype('category')
In this part we extract the insights from the data to discover patters and summerize the data using different statistical visualization techniques like bargraphs, dotplots etc.
Analysis done:-
#Plotting the number of leaves took by particular person given ID
sns.set_style("whitegrid")
plt.gcf().set_size_inches(10,6)
sns.countplot(data=data,x='ID',)
Employee with ID 3 and 28 took most no. of leaves
plt.gcf().set_size_inches(10,6)
sns.barplot(data=data,x='ID',y='Absenteeism time in hours',ci = None,estimator=sum)
But when adding the total hours of each id we can see that employee with ID 9, 11, 14, 15, 20, 34, 36 were absent for less number of times but for long hours.
#Employee Absent in hours ID wise and wether employee is drinker or not
plt.gcf().set_size_inches(15,4)
sns.scatterplot(data=data,x='ID',y='Absenteeism time in hours',hue = 'Social drinker',s=80)
#Employee Absent in hours ID wise and wether employee is smoker or not
plt.gcf().set_size_inches(15,4)
sns.scatterplot(data=data,x='ID',y='Absenteeism time in hours',
hue = 'Social smoker',s=80)
#Employee Absent in hours ID wise and whats is their educational qualification
plt.gcf().set_size_inches(15,6)
sns.scatterplot(data=data,x='ID',y='Absenteeism time in hours',
hue= 'Education',s=80)
#Plotting the number of reason for absence
#plt.xticks(rotation=90)
plt.gcf().set_size_inches(10,6)
sns.countplot(data=data,y='Reason for absence')
Mostly people took leave for medical consulation, Dental Consulation and physiotherapy
And people with muscloskeletal (joint pain) also too significant leaves
#distribution of Reason of absence with Absenteeism time in hours
#getting total no. of hours absent per reason
plt.gcf().set_size_inches(10,6)
sns.barplot(data=data,y='Reason for absence',x='Absenteeism time in hours',ci = None,estimator=sum)
Here people having problem with musculoskeletal system (Joints pain) took long hours leaves followed by external causes
#No. of emoployee absent month wise
sns.countplot(data=data,x='Month of absence')
#People took leave mostly in 3rd month
#Total no. of hours absence month wise
sns.barplot(data=data,x='Month of absence',y='Absenteeism time in hours',ci = None,estimator=sum)
#Employees took long leaves in 7th month also
#Frequency of weekdays
sns.countplot(data=data,x='Day of the week')
#Employees were absent mostly on mondays and tuesdays
#Total no. of hours absent per weekday
sns.barplot(data=data,x='Day of the week',y='Absenteeism time in hours',ci = None,estimator=sum)
#same as frequency people were absent mostly on monday and tuesday for long hours
#Frequency of leaves season wise
sns.countplot(data=data,x='Seasons')
#people took most of leaves in spring season
#Total no. of hours absent season wise
sns.barplot(data=data,x='Seasons',y='Absenteeism time in hours',ci = None,estimator=sum)
#From above an below graph we can conclude that
#Employees took small hours leave in spring and autumn but long hours leave in winter
#Count plot of dispilinary failure
sns.countplot(data=data,x='Disciplinary failure')
plt.show()
#there is very less no. of people with disciplinary failure
#Disciplinary failure with absence hours plot
sns.barplot(data=data,x='Disciplinary failure',y='Absenteeism time in hours',ci = None,estimator=sum)
#It seems people with no disciplinary failure were more absent
#No. of leaves took education wise
sns.countplot(data=data,x='Education')
#Mostly employee with high school education took more leaves
#Total no. of hours absent education wise
sns.barplot(data=data,x='Education',y='Absenteeism time in hours',ci = None,estimator = sum)
#High school qualified emploees were mostly absent
#counts of social driker took leaves
sns.countplot(data=data,x='Social drinker')
#Here people with social drinking habit was absent more times
#Absent hours plot with Social drinker
sns.barplot(data=data,x='Social drinker',y='Absenteeism time in hours',ci = None,estimator = sum)
#Mostly people with drinking habit was absent for longer period of time
#Plot of social smoker
sns.countplot(data=data,x='Social smoker')
#seems like there is very few no. of people with smoking habit
#Plot of Social smoker with absent hours
sns.barplot(data=data,x='Social smoker',y='Absenteeism time in hours',ci = None,estimator = sum)
#Here people with no smoking habit took long hours of leave
#Frequency of leaves taken by employees acc to no. of son they have
sns.countplot(data=data,x='Son')
#People with no son took leaves more frequently
#Total no. of hours absent acc to no. of sons employees have
sns.barplot(data=data,x='Son',y='Absenteeism time in hours',ci = None,estimator = sum)
#Seems people with 2 and 0 Sons took long hours leave
for i in range(0,5):
a = data[data['Son'] == i]
print (i,a['Absenteeism time in hours'].sum())
for i in range(0,5):
a = data[data['Son'] == i]
print (i,a['Absenteeism time in hours'].mean())
#Frequency of leaves taken by employees acc to no. of Pet they have
sns.countplot(data=data,x='Pet')
#People with no Pet took leaves more frequently
#Total no. of hours absent acc to no. of Pets employees have
sns.barplot(data=data,x='Pet',y='Absenteeism time in hours',ci = None)
#Seems people with 1 and 4 pets took long hours leave
#Frequency of leaves taken by age
plt.gcf().set_size_inches(10,4)
sns.countplot(data=data,x='Age')
#Employee of age 28 and 38 took more frequent leaves
#Total no of hours absence by age
plt.gcf().set_size_inches(10,4)
sns.barplot(data=data,x='Age',y='Absenteeism time in hours',ci = None,estimator = sum)
#employees with 28 and 36 took long hours leave
Distribution of :-
Transportation expense, Distance from Residence to Work, Service time, Age, Work load Average/day, Hit target, Weight, Heigh, Body mass index</br>
With
Absenteeism time in hours
#Relation of continuous variable with target variable
sns.pairplot(data = data,
x_vars = Continuous[0:-1],
y_vars = Continuous[-1])
#Distribution of continuous variable
data[Continuous].hist(bins = 20,figsize = (15,10))
Our data contains certain missing values, these values might have occured due to human error or if the data collection is not done properly. This missing data can create inaccurate results. So first these missing values should be taken care.
We need to first check if a feature have more than 30 percent of missing values, if so then we need to drop that feature. Then we checked if there is any value is equal to zero in 'Reason of absence', 'Month of absence' and 'Absentieeism time in hours beacuse the observations related to these features cannot be zero.
Now to impute these missing values we can use mean, median or mode. Here mean is used to impute these values and for 'Body mass index' we have calculated the value using BMI formulae.
print(len(EmpAb[EmpAb['Reason for absence'] == 0]))
print(len(EmpAb[EmpAb['Month of absence'] == 0]))
print(len(EmpAb[EmpAb['Absenteeism time in hours'] == 0]))
#Here dataset have some values equal to zero in Reason for absence and month of absence
#These values cannot be zero
#So these values should be replace with NA and impute afterwards
EmpAb['Reason for absence'] = EmpAb['Reason for absence'].replace(0,np.nan)
EmpAb['Month of absence'] = EmpAb['Month of absence'].replace(0,np.nan)
EmpAb['Absenteeism time in hours'] = EmpAb['Absenteeism time in hours'].replace(0,np.nan)
#Replacing 0 with NaN in Reason for absence and Month of absence
missingValues = pd.DataFrame(EmpAb.isnull().sum(),columns = ['No. of missing values'])
for c in EmpAb.columns:
missingValues.loc[c,'Percent'] = (EmpAb[c].isnull().sum()/len(EmpAb))*100
missingValues
#Checking the missing number and percentage of missing values
for col in Categorical:
EmpAb[col] = EmpAb[col].fillna(EmpAb[col].mode().values[0])
for col in Continuous:
if col == 'Body mass index': continue
EmpAb[col] = EmpAb[col].fillna(EmpAb[col].mean())
#Imputing missing values except 'Body mass index'
#Replacing categorical values with mode
#And replacing numerical values with mean
BMI = 'Body mass index'
EmpAb[BMI]=EmpAb[BMI].fillna(EmpAb['Weight']/np.square(EmpAb['Height']))
#'Body mass index' is filled using the BMI = Weight/(Height)^2
#Changing categorical datatype
for cat in Categorical:
EmpAb[cat] = EmpAb[cat].astype('category')
EmpAb.isnull().sum().sum()
#Checking if any missing value left
#We can also use KNN imputation for missing value analysis
# from fancyimpute import KNN
# EmpAb = pd.DataFrame(KNN(k = 3).fit_transform(EmpAb), columns = EmpAb.columns)
# for cat in Categorical:
# EmpAb[cat] = EmpAb[cat].round()
# EmpAb[cat] = EmpAb[cat].astype('category')
#Changing datatypes of categorical variables
Outliers are the extreme values which are far from the common obesrvations and mean of the data. These value can impact the model on a large scale and can effect the prediction accuracy to a large extend.
First we need to detect these outliers and remove them or impute them with meaningful values.
#Detection of outliers
for i in Continuous:
plt.gcf().set_size_inches(10,3)
sns.boxplot(data = EmpAb,x=i )
plt.show()
Quarantile method is used to remove outliers.
#creating dataset with outliers for further evaluation
EmpAbWithOutliers = EmpAb.copy()
#Removing outliers using quarentine method
#Replaing with NaN
for i in Continuous:
q75,q25 = np.percentile(EmpAb[i],[75,25])
iqr = q75-q25
min_bar = (q25-(1.5*iqr))
max_bar = (q75+(1.5*iqr))
EmpAb.loc[EmpAb[i]<min_bar,i] = np.nan
EmpAb.loc[EmpAb[i]>max_bar,i] = np.nan
#Filling values with mean
for col in Continuous:
EmpAb[col] = EmpAb[col].fillna(EmpAb[col].mean())
#checking if any missing value left
EmpAb.isna().sum().sum()
#we can also use KNN to impute outlier values
# from fancyimpute import KNN
# EmpAb = pd.DataFrame(KNN(k = 3).fit_transform(EmpAb), columns = EmpAb.columns)
This is the most important part where the features that plays an important role in pridicting the target variable are separated form the less important variable.
From the heatmap distribution we can see that weight is highly correlated with Body mass index.
#Plotting corelational matrix of continuous variables
CorrMat = EmpAb[Continuous].corr()
plt.gcf().set_size_inches(10,8)
sns.heatmap(CorrMat,annot =True)
#Here we can see weight and body mass index are highly correlated so one variable need to be dropped
#Removing variable weight
EmpAb = EmpAb.drop(columns=['Weight'])
EmpAbWithOutliers = EmpAbWithOutliers.drop(columns=['Weight'])
# Continuous.remove('Weight')
The data have very high difference in the obsevations of the features. For Ex. Average work is in thousands but age is in tens. so we need to first convert these values to a normal scale before feeding into machine learning algorithms.
To convert these values normalization technique is used in which minimum value is first subtracted form the observation and then divided by maximum value minus minimum of that feature. This reduced every observation in the data beteween 0 to 1. This is only done for the continuous features.
#Normalizing the values of continuous variables
def DoNormalization(data):
for var in Continuous:
if var == 'Absenteeism time in hours' or var == 'Weight': continue
data[var]=(data[var]-data[var].min())/((data[var].max())-(data[var].min()))
return data
# EmpAb = DoNormalization(EmpAb)
# EmpAbWithOutliers = DoNormalization(EmpAbWithOutliers)
Dividing the data set into 80% training data and 20% test data.
#Functions to create normal sampling
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
def CreateSample(Data):
Target = 'Absenteeism time in hours'
InputData = Data.loc[:,Data.columns != Target]
InputLabel = Data[Target]
return train_test_split(InputData,InputLabel,test_size=0.2)
Now the clean, formatted and splitted data is fed to different machine learning algorithms to create a model which can predict the targeted value on new input data with higher accuracy. And different evaluation metrices is used to evaluate the predictive accuracy of these models.
Two functions are created for regression and classification modelling. But why regression modelling is required althought this is a classification problem?
In actual the target variable is not normally distributed in fact it have two peaks showing same frequency for two different values.
So first the evaluation is done on the regression models and then the problem is converted to classification using the method called binning in which we create categories by dividing the values.
import warnings
warnings.filterwarnings("ignore")
#Functions for moelling and evaluation for regression
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import Lasso
from sklearn.svm import SVR
modelsRegression = []
modelsRegression.append(('LR ',LinearRegression()))
modelsRegression.append(('DTR',DecisionTreeRegressor()))
modelsRegression.append(('RFR',RandomForestRegressor()))
modelsRegression.append(('KNN',KNeighborsRegressor()))
modelsRegression.append(('LSO',Lasso()))
#modelsRegression.append(('SVR',SVR()))
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
def MAPE(actual,predicted): return np.mean((abs(actual-predicted))/actual)*100
def ModellingAndEvaluationRegression(xTrain,xTest,yTrain,yTest):
MeanAbsoluteErrors = []
MeanSquaredErrors = []
RSquaredValue = []
for name,model in modelsRegression:
model.fit(xTrain,yTrain)
predict = model.predict(xTest)
MeanAbsoluteErrors.append((name,mean_absolute_error(yTest,predict)))
MeanSquaredErrors.append((name,mean_squared_error(yTest,predict)))
RSquaredValue.append((name,r2_score(yTest,predict)))
print('Mean Absolute Errors:-')
for name, score in MeanAbsoluteErrors: print(name,':',score)
print()
print('Mean Squared Errors:-')
for name, score in MeanSquaredErrors: print(name,':',score)
print()
print('R Squared Value:-')
for name, score in RSquaredValue: print(name,':',score)
#Functions for moelling and evaluation for Classification
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
modelsClassification = []
modelsClassification.append(('LR ',LogisticRegression()))
modelsClassification.append(('DTC',DecisionTreeClassifier()))
modelsClassification.append(('RFC',RandomForestClassifier()))
modelsClassification.append(('KNN',KNeighborsClassifier()))
modelsClassification.append(('GNB',GaussianNB()))
#modelsClassification.append(('SGD',SGDClassifier()))
#modelsClassification.append(('SVC',SVC()))
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
def ModellingAndEvaluationClassification(xTrain,xTest,yTrain,yTest):
ConfusionMatries = []
AccuracyScores = []
ClassificationReports = []
for name,model in modelsClassification:
model.fit(xTrain,yTrain)
predict = model.predict(xTest)
ConfusionMatries.append((name,confusion_matrix(yTest,predict)))
AccuracyScores.append((name,accuracy_score(yTest,predict)))
ClassificationReports.append((name,classification_report(yTest,predict)))
print('ConfusionMatrices:-')
for name, score in ConfusionMatries:
print(name,':')
print(score)
print()
print('Accuracy Scores:-')
for name, score in AccuracyScores:
print(name,':',score)
print()
print('Classification Reports:-')
for name, score in ClassificationReports:
print(name,':')
print(score)
print()
. | class1(predicted) | class2(predicted) |
---|---|---|
class1(Actual) | TP | FN |
class2(Actual) | FP | TN |
First the regression models are evaluated using above metrices but from the results we can see that the model is underfitting the data. The value of R2 is even in negative which is even worse.
After checking with different techniques the problem is changed to binning and three seperate categories are created. And from the scores we can see that RandomForestClassifier is giving 72 percent of accuracy.
#Normal evaluation of dataset using regression
DataNrml = EmpAb.copy()
DataNrml = DoNormalization(DataNrml)
xTrain,xTest,yTrain,yTest = CreateSample(DataNrml)
ModellingAndEvaluationRegression(xTrain,xTest,yTrain,yTest)
DataVar = EmpAb.copy()
for i in DataVar.columns:
sns.distplot(DataVar[i])
plt.show()
#Removing some variable manually to which are non uniformally distributed
DataVar = DataVar.drop(columns= ['Seasons',
'Disciplinary failure','Education',
'Social drinker','Social smoker','Pet','Height'])
DataVar = DoNormalization(DataVar)
xTrain,xTest,yTrain,yTest = CreateSample(DataVar)
ModellingAndEvaluationRegression(xTrain,xTest,yTrain,yTest)
DataCbrt = EmpAb.copy()
#Taking cuberoot of the target variable to get a nearly uniform distrbution curve
DataCbrt['Absenteeism time in hours'] = np.cbrt(DataCbrt['Absenteeism time in hours'])
DataCbrt = DoNormalization(DataCbrt)
xTrain,xTest,yTrain,yTest = CreateSample(DataCbrt)
ModellingAndEvaluationRegression(xTrain,xTest,yTrain,yTest)
DataCls = EmpAbWithOutliers.copy()
DataCls = EmpAb.copy()
sns.distplot(DataCls['Absenteeism time in hours'])
DataCls = DataCls.drop(columns= ['Seasons',
'Disciplinary failure','Education',
'Social drinker','Social smoker','Son','Pet','Height'])
#Truning the problem into classification as regression not giving best results
DataCls = DataCls[DataCls['Absenteeism time in hours']<=10]#.round()
#DataCls['Absenteeism time in hours'] = DataCls['Absenteeism time in hours'].astype('category')
DataCls['Absenteeism time in hours'].value_counts()
#Divided target variable into 3 categories
DataCls['Absenteeism time in hours'] = pd.qcut(DataCls['Absenteeism time in hours'],2,labels = False)
DataCls['Absenteeism time in hours'].value_counts()
DataCls['Absenteeism time in hours'] = DataCls['Absenteeism time in hours'].astype('category')
#Run classification on new dataset
DataCls = DoNormalization(DataCls)
xTrain,xTest,yTrain,yTest = CreateSample(DataCls)
ModellingAndEvaluationClassification(xTrain,xTest,yTrain,yTest)
#Here seems RandomForestClassifier gives the best result nearly 70%
After checking with different regression and classification algoritms, it is concluded that the most of the data have bimodal distribution and the target variable is also bimodal. So the problem is converted to a 2 class classification problem and RandomForestClassifier gives 78% accuracy.
DataCls = EmpAb.copy()
# DataCls = EmpAbWithOutliers.copy()
#Truning the problem into classification as regression no giving best results
DataCls['Absenteeism time in hours'] = DataCls['Absenteeism time in hours'].astype('category')
DataCls['Absenteeism time in hours'] = DataCls['Absenteeism time in hours'].round()
DataCls['Absenteeism time in hours'].value_counts()
#Divided target variable into 3 categories
DataCls['Absenteeism time in hours'] = pd.qcut(DataCls['Absenteeism time in hours'],2,labels = [1,2])
#Run classification on new dataset
DataCls = DoNormalization(DataCls)
xTrain,xTest,yTrain,yTest = CreateSample(DataCls)
ModellingAndEvaluationClassification(xTrain,xTest,yTrain,yTest)
#Here seems RandomForestClassifier gives the best result nearly 78%
Beacause of high rate of absenteesim organization has to suffer decrease in revenu. Here the total cost due to absenteesim is calculated.
LossVar = ["Month of absence","Work load Average/day ","Service time","Absenteeism time in hours"]
Loss = EmpAbWithOutliers[LossVar]
Loss["Loss"]=(Loss["Work load Average/day "]/Loss["Service time"])*Loss["Absenteeism time in hours"]
MonthlyLoss = Loss[["Month of absence","Loss"]]
MonthlyLoss['Loss'] = MonthlyLoss['Loss'].astype('int')
MonthlyLoss.groupby("Month of absence").sum()
sns.barplot(data=MonthlyLoss,x='Month of absence',y='Loss',ci = None,estimator=sum)